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Abstract 

We report a new algorithm for constructing pathways between local 
minima that involve a large number of intervening transition states on the 
potential energy surface. A significant improvement in efficiency has been 
achieved by changing the strategy for choosing successive pairs of local 
minima that serve as endpoints for the next search. We employ Dijkstra's 
algorithm 1 to identify the 'shortest' path corresponding to missing con- 
nections within an evolving database of local minima and the transition 
states that connect them. The metric employed to determine the short- 
est missing connection is a function of the minimised Euclidean distance. 
We present applications to the formation of buckminsterfullerene and to 
the folding of the Bl domain of protein G, tryptophan zippers, and the 

villin headpiece subdomain. The corresponding pathways contain up to 
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163 transition states, and will be used in future discrete path sampling 
calculations. 

1 Introduction 

The discrete path sampling (DPS) approach 2 " 4 provides a means to sample path- 
ways corresponding to 'rare events' using a coarse-grained framework based on 
the underlying potential energy surface (PES). The PES can be formally parti- 
tioned into the catchment basins of local minima, 5 while rate constants for transi- 
tions between these basins can be estimated using statistical rate theory 6-10 once 
transition states on the basin boundaries have been found. Here we adopt the 
geometrical definition of a transition state as a stationary point on the PES with 
a single negative Hessian eigenvalue, following Murrell and Laidler. 11 Minima are 
stationary points with no negative Hessian eigenvalues. 

Two minima may be connected via a single transition state if they are suffi- 
ciently close to each other in configuration space. More generally, local minima 
may interconvert via any number of multi-step paths, and the minimum num- 
ber of elementary steps will depend upon the isomers in question. 3 Here we 
consider an elementary step as a rearrangement that involves a single transi- 
tion state. Locating multi-step pathways on a complex PES can be a difficult 
task. Double-ended methods 12-38 usually do not reveal all the transition states at 
once, mainly because of the existence of multiple barrier height and path length 
scales. 39 Consecutive double-ended searches have proved to be an efficient way 
of working around the above problem, and this strategy was adopted in our pre- 
vious work. 2 ' 38 An essential part of this approach is a mechanism to incorporate 
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the information obtained in all the previous searches into the next one. Various 
strategies can be adopted, the most general and effective being the one based 
on Euclidean separation, and we refer the reader to the original publications for 
details. 2,38 

A guess for the initial pathway is required for most double-ended searches. 
Linear interpolation is a straightforward way to automate this part of the cal- 
culation. 31 ~ 34 ' 38 For large endpoint separations guessing the initial pathway can 
be difficult, and there is a large probability of finding many irrelevant stationary 
points at the beginning of the calculation. 

Due to the discrete nature of the double-ended methods such as NEB 32 ' 33 
and DNEB, 38 which use a series of images to represent the path, the candidate 
transition states obtained from such searches usually need to be optimised fur- 
ther. 36,38 For this purpose we employ eigenvector-following methods. 40-52 The 
connectivity is obtained by following the two unique steepest-descent (SD) paths 
downhill from each transition state. In our calculations the refinement of tran- 
sition states and calculation of approximate steepest-descent paths are the most 
time consuming steps. 

The connection algorithm described in Ref. 38 uses one double-ended search 
per cycle. However, we have found that this approach can be overwhelmed by the 
abundance of stationary points and pathways for complicated rearrangements. 
In the present contribution we introduce the idea of an unconnected pathway 
and make the connection algorithm more focused by allowing more than one 
double-ended search per cycle. This approach, in combination with some other 
modifications described in the next section, greatly reduces the computational 
demands of the method, and has allowed us to tackle more complicated problems. 
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The new algorithm has been implemented within our OPTIM package, and a public 
domain version will be made available for download from the internet. 53 The 
examples presented below each involve the search for an initial connection between 
distant endpoints for use in subsequent DPS calculations. 

2 Methods 

2.1 Outline of the Connection Algorithm 

A detailed description of the previous version of our connection algorithm was 
provided elsewhere. 38 Each iteration consists of one double-ended transition 
state search, followed by refinement of transition state candidates using (hybrid) 
eigenvector-following, 40-52 and calculation of approximate steepest-descent paths 
for each transition state to establish the connectivity. 

We classify all the known local minima into three categories: S minima, which 
are connected to the starting structure, F minima, which are connected to the 
final structure, and U minima, which are so far unconnected to the S and F sets. 
U minima can be connected between themselves but not to any S or F minima. 
When a connection is established between the members of U and S or F sets, the 
unconnected minimum, and all the minima connected to it, become members of 
S or F, respectively. If a connection is found between members of the S and F 
sets then the algorithm terminates. 

Before each cycle a decision must be made as to which minima to try and 
connect next. Various strategies can be adopted, for example, selection based on 
the order in which transition states were found, 2 or, selection of minima with the 
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minimal separation in Euclidean distance space. 38 However, when the endpoints 
are very distant in configuration space, neither of these approaches is particularly 
efficient. The number of possible connections that might be tried simply grows 
too quickly if the S, F and U sets become large. However, the new algorithm 
described below seems to be very effective. 

2.2 A Dijkstra-based Selector 

The modified connection algorithm we have used in the present work is based on 
a shortest path method proposed by Dijkstra. 1 ' 54 We can describe the minima 
that are known at the beginning of each connection cycle as a complete graph, 55 
G = (M, E) , where M is the set of all minima and E is the set of all the edges 
between them. Edges are considered to exist between every pair of minima u and 
v, even if they are in different S, F or U sets, and the weight of the edge is chosen 
to be a function of the minimum Euclidean distance between them: 56 



w(u, v) 



0, if u and v are connected via a single transition state, 

oo, if n{u,v) = n max , 
f(D(u,v)), otherwise, 



(1) 

where n(u, v) is the number of times a pair (u, v) was selected for a connection 
attempt, n max is the maximal number of times we may try to connect any pair 
of minima, and D(u,v) is the minimum Euclidean distance between u and v. / 
should be a monotonically increasing function, such as f(D(u,v)) = D(u,v) 2 . 
We denote the number of minima in the set M = S U U U F, as m, and the 
number of edges in the set E as e = mim — l)/2. 

Using the Dijkstra algorithm 1 ' 54 and the weighted graph representation de- 



scribed above, it is possible to determine the shortest paths between any minima 
in the database. The source is selected to be one of the endpoints. Upon ter- 
mination of the Dijkstra algorithm, a shortest path from one endpoint to the 
other is extracted. If the weight of this pathway is non-zero, it contains one or 
more 'gaps'. Connection attempts are then made for every pair (u, v) of adjacent 
minima in the pathway with non-zero w(u, v) using the DNEB approach. 38 

The computational complexity of the Dijkstra algorithm is at worst 0(m 2 ), 
and the memory requirements scale in a similar fashion. The most appropriate 
data structure is a weighted adjacency matrix. For the calculations presented in 
this paper, the single source shortest paths problem was solved at the beginning 
of each cycle, which took less than 10% of the total execution time for the largest 
database encountered. We emphasise here that once an initial path has been 
found, the perturbations considered in typical DPS calculations will generally 
involve attempts to connect minima that are separated by far fewer elementary 
rearrangements than the endpoints. It is also noteworthy that the initial path is 
unlikely to be contribute significantly to the overall rate constant. Nevertheless, 
it is essential to construct such a path to begin the DPS procedure. 

The nature of the definition of the weight function allows the Dijkstra algo- 
rithm to terminate whenever a second endpoint, or any minimum connected to 
that endpoint via a series of elementary rearrangements, is reached. This obser- 
vation reduces the computational requirements by an amount that depends on 
the distribution of the minima in the database among the S, U and F sets. One 
of the endpoints is always a member of the S set, while the other is a member of 
F set. Either one can be chosen as the source, and we have found it most efficient 
to select the one from the set with fewest members. However, this choice does 
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not improve the asymptotic bounds of the algorithm. 

3 Results 

3.1 Buckminsterfullerene 

Various suggestions 57-75 have been made for the formation mechanism of buckmin- 
sterfullerene 76,77 in the gas phase. In total, there are 1812 different Cqq fullerene 
isomers, 78 ' 79 which probably interconvert via the 'pyracylene' or 'SW rearrange- 
ment. 80 This process has been investigated in several previous studies, 81-88 and 
the most accurate calculations 89 ' 90 yield a picture of the energy landscape that 
is quite similar to that obtained using a tight-binding potential. 86,91 The same 
model 92-94 was therefore used in this initial pathway calculation to minimise the 
computational expense, which is significantly greater than for analytical empiri- 
cal potentials. 95 The present calculation therefore has most in common with the 
annealing study of Xu and Scuseria, 96 although the latter work did not involve 
transition state calculations. 

An initial high-energy starting point was constructed by simply placing sixty 
carbon atoms in a container and minimising the energy. The resulting structure, 
which contains a number of large rings and chains, including polyacetylene frag- 
ments, is shown in Figure ^ Although the structure of the buckminsterfullerene 
endpoint is well known, it is not at all clear which would be the best permutational 
isomer to attempt a connection with, since the endpoints are so different. The 
distance between these structures was minimised with respect to permutation- 
inversion isomers, centre of mass, and orientational coordinates. However, this 
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connection still presents a significant challenge, and required 383 cycles of the 
Dijkstra-based algorithm to achieve a complete path (Figure [TJ. 

An enlarged view of the pathway for the last two steps is also shown in Figure 
^ Both rearrangements correspond to the SW process mentioned above. 80 The 
first step converts a patch containing seven, six, and two five-membered rings 
into a patch with three six-membered rings and one five-membered ring. The 
second step involves a more conventional process linking two patches that both 
contain two six-membered and two five-membered rings. The energy profile in 
this part of the pathway is consistent with the pattern of high barriers previously 
discussed for the low-energy region of the PES. 86 ' 90,91 The present results sug- 
gest that further investigation of paths involving non-fullerene Ceo isomers may 
be worthwhile, although we note once again that this initial path may not be 
dynamically significant. 

3.2 GB1 Hairpin 

The GB1 hairpin consists of residues 41-56 from the C-terminal fragment of the 
Bl domain of protein G. It forms a /5-hairpin both in the complete protein, 97 
and for the isolated fragment in solution. 98 A number of previous experimental 
and theoretical studies have been conducted for this system, 99-111 including a 
DPS investigation. 112 Obtaining an initial path between unfolded and hairpin 
conformations in the latter study was not an easy task, and the new algorithm 
speeded up this part of the calculation by at least a couple of orders of magnitude. 

The CHARMM program 113 has previously been interfaced to our OPTIM code, 53 
which includes a wide variety of algorithms for locating stationary points and 
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characterising pathways, and now includes the Dijkstra-based connection ap- 
proach. As for the previous DPS study we employed the CHARMM19 force 
field with the EEF1 implicit solvation model. 114 The path illustrated in Figure El 
consists of 163 transition states, and was found in 126 cycles of the Dijkstra-based 
algorithm. It connects a partially collapsed non-native structure, with two turns, 
to a conformation from the native hairpin ensemble. 

3.3 Tryptophan Zippers 

Tryptophan zippers are stable fast-folding /3-hairpins designed by Cochran et 
al., 115 which have recently generated considerable interest. 116 ' 117 In the present 
work we have obtained native to denatured state rearrangement pathways for 
five tryptophan zippers: trpzip 1, trpzip 2, trpzip 3, trpzip 3-1 and trpzip 4. The 
notation is adopted from the work of Du et al. 117 All these peptides contain 
twelve residues, except for trpzip 4, which has sixteen. Tryptophan zippers 1, 2, 
3 and 3-1 differ only in the sequence of the turn. Experimental measurements 
of characteristic folding times for these peptides have shed some light on the 
significance of the turn sequence in determining the stability and folding kinetics 
of peptides with the /3-hairpin structural motif. 117 

To model these molecules we used a modified CHARMM19 force field, 113 with 
symmetrised ASN, GLN and TYR dihedral angle and CTER improper dihedral 
angle terms, to ensure that rotamers of these residues have the same energies 
and geometries. 118 Another small modification concerned the addition of a non- 
standard amino acid, D-proline, which was needed to model trpzip 3. The implicit 
solvent model EEF1 was used to account for solvation, 114 with a small change to 
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the original implementation to eliminate discontinuities. 119 

We have used the Dijkstra-based connection algorithm to obtain folding path- 
ways for all five trpzip peptides. In each case the first endpoint was chosen to 
be the native state structure, which, for 1, 2 and 4 trpzips, was taken from the 
Protein Data Bank (PDB). 120 There are no NMR structures available for 3 and 
3-1, so for these peptides the first endpoint was chosen to be the putative global 
minimum obtained using the basin-hopping method. 121-123 The second endpoint 
was chosen to be an extended structure, which was obtained by simply minimis- 
ing the energy of a conformation with all the backbone dihedral angles set to 
180 degrees. All the stationary points (including these obtained during the con- 
nection procedure) were tightly converged to reduce the root-mean-squared force 
below 10~ 10 kcal mol -1 A -1 . 

Each of the five trpzip pathway searches was conducted on a single Xeon 
3.0 GHz CPU and required less than 24 hours of CPU time. The timings could 
certainly be improved by optimising the various parameters employed throughout 
the searches. However, it is more important that the connections actually succeed 
in a reasonably short time. It only requires one complete path to seed a DPS 
run, and we expect the DPS procedure to reduce the length of the initial path 
by a least a factor of two in sampling the largest contributions to the effective 
two-state rate constants. The results of all the trpzip calculations are shown in 
Figure 121 
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3.4 Villin Headpiece Subdomain 

The villin headpiece subdomain is the thermostable 35-residue C-terminal section 
of the headpiece domain of chicken villin protein. The sequence used for the NMR 
structure determination 124 included an additional methionine residue at the N- 
terminus from the expression system; thus, we are considering the 36-residue 
entity here, PDB code 1VII. The structure consists of a bundle of three short 
helices and a closely-packed hydrophobic core. The three helices are numbered 
from the N-terminus to the C-terminus. A turn connects helices 2 and 3, and 
helices 1 and 2 are joined by a loop. 

Its small size and fast folding (a folding time on the order of tens of mi- 
croseconds 125,126 ) make the villin headpiece subdomain an attractive target for 
computational studies, including the 1 /is explicit-water MD simulation of Duan 
and Kollman. 127-139 Our DPS study of the villin headpiece subdomain employed 
the UNRES (united-residue) force field and model, 140-142 in which two interaction 
sites are assigned to each residue: one representing the main chain peptide group 
and one representing the side chain. 

As for the GB1 hairpin, the construction of an initial path with the original 
connection algorithm proved difficult, and a considerable increase in both the 
efficiency and the success rate was attained with the Dijkstra-based strategy. The 
example path illustrated in Figure |U contains 62 transition states and required 
142 cycles of the Dijkstra-based algorithm. In this rearrangement, which connects 
a partially folded minimum to the locally minimised PDB structure, helices 2 and 
3 are completed, the C-terminal residues pack correctly between helices 1 and 3, 
and the three helices adopt their native relative orientations. 
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4 Conclusions 



Discrete path sampling calculations of effective two-state rate constants require 
an initial path consisting of local minima and the transition states that connect 
them. 2-4, 112,143,144 For complex rearrangements the number of elementary steps 
involved may be rather large, and new methods for constructing an initial path 
are needed. This path does not need to be the shortest, or the fastest, but it does 
need to be fully connected. In the present work we have described a connection 
procedure based upon Dijkstra's shortest path algorithm, which enables us to 
select the most promising paths that include missing connections for subsequent 
double-ended searches. We have found that this approach, which is now imple- 
mented within the OPTIM package, 53 enables initial paths containing more than 
a hundred steps to be calculated automatically for a variety of systems. Some 
typical results have been presented for buckminsterfullerene, trpzip peptides, the 
GB1 hairpin, and the villin headpiece subdomain. These paths will be employed 
to seed future discrete path sampling calculations. 
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Figure Captions 

1. Top: connected path between a high-energy Ceo cluster and buckminster- 
fullerene. The path contains 82 transition states, and required 383 cycles 
of the Dijkstra-based connection algorithm, including 1620 DNEB searches. 
V is the potential energy in hartree and s is the integrated path length in 
bohr. Bottom: enlargement of the above plot for the last two SW rearrange- 
ments. The atoms mainly involved in these two steps are shaded, and both 
local minima and transition state structures are indicated at appropriate 
points along the path. 

2. Energy profiles for native to denatured state rearrangements of tryptophan 
zippers found by the Dijksta-based connection algorithm. For each profile 
the number of steps in the pathway, the number of connection algorithm 
cycles, the total number of DNEB searches and the total number of sta- 
tionary points in the database (recorded upon termination of the algorithm) 
are shown. The total number of stationary points is presented in the form 
(n, m), where n is the number of minima and m is the number of transition 
states. The potential energy, V, is given in the units of kcal/mol, and the 
integrated path length, s, is given in the units of A. 

3. Connected path between non-native and /3-hairpin conformations for the 
GB1 peptide. The path contains 163 transition states, and required 126 
cycles of the Dijkstra-based connection algorithm, including 1610 DNEB 
searches. V is the potential energy in kcal/mol and s is the integrated path 
length in A. 
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4. Connected path between non-native and native conformations for the villin 
headpiece subdomain. The two endpoint minima are shown with the N- 
terminal helix (helix 1) on the left. The other two helices and the C- 
terminus are also labelled. The path contains 62 transition states, and 
required 142 cycles of the Dijkstra-based connection algorithm, including 
294 DNEB searches. V is the potential energy in kcal/mol and s is the 
minimised Euclidean distance between consecutive stationary points in A. 
For this system we simply join the stationary points with straight lines. 
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